feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42)#747
Conversation
…P-42) Ship the first, fully-deterministic slice of the roadmap-#19 benchmark: the token-reduction numbers behind the "massive token savings" claim. Reuses the frozen Spec 065 tool corpus (45 tools, 7 reference servers) as a versioned, non-drifting universe and tiktoken cl100k_base (already a dep) as a reproducible model-agnostic estimator. Compares the three routing modes' static context cost: - baseline (all upstream tools loaded directly) - retrieve_tools (BM25 discovery + call_tool variants) - code_execution (orchestration + retrieve_tools) over the corpus and reports per-mode savings. Real proxy tool defs are captured verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance recorded). Emits report.json + a self-contained dashboard.html (gitignored; reports never committed, per Spec 065 CN-003). Conservative by construction: input schemas excluded uniformly understates the baseline, so measured savings (65.5% / 70.3% on the 45-tool corpus) are a floor. Methodology, limitations, and the scoped-but-not-yet-built follow-ups (live run with full schemas + accuracy/latency, LLM e2e, CI publish) are in bench/README.md. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing>
Deploying mcpproxy-docs with
|
| Latest commit: |
9a92d71
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://5553fbac.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://feat-mcp-42-bench-harness.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 27971074417 --repo smart-mcp-proxy/mcpproxy-go
|
… smoke test KimiReviewer finding 2: code_execution is at line 626 in mcp.go at 89f06b5, not 675 as claimed. Line numbers drift with unrelated edits and the actual function names are the stable identifier — remove all line numbers from the provenance comment to prevent future rot. KimiReviewer finding 3: add TestWriteReports_SmokeTest covering WriteReports output (JSON round-trips to Report, HTML is non-empty and contains all mode names). All 5 tests pass; golangci-lint v2 clean. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing>
There was a problem hiding this comment.
✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).
This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.
Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).
|
CodexReviewer: changes requested (benchmark integrity). The fixture ( |
…cl. management tools (MCP-3161) The token-reduction benchmark scored only 6 hand-maintained proxy tools and omitted the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries) that both routing modes append via buildManagementTools. That undercounted the proxy-mode context cost and inflated the headline savings (Codex finding on PR #747). Replace bench/proxy_tools_v1.json with server.ProxyModeToolDefs, which builds the catalog from the live builders (buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go) so it can never drift from production and always reflects the tools the agent actually sees. This also fixes a second drift: the fixture's retrieve_tools descriptions did not match the per-mode builder descriptions. Corrected figures over the 45-tool Spec 065 corpus (name+description only): retrieve_tools ~17% (10 tools), code_execution ~43% (6 tools). Updated README and notes; the schema-exclusion claim is no longer unambiguously conservative now that large-schema management tools are in the proxy cost. Tests: bench asserts both modes include the 4 management tools; internal/server pins ProxyModeToolDefs to the builders so the catalog can't silently drift. Related #747
There was a problem hiding this comment.
✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).
This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.
Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).
There was a problem hiding this comment.
✅ Gatekeeper approval — MCP-42 benchmark harness on corrected head 9a92d71. Full mandated gate satisfied: CodexReviewer (first review) caught inflated savings (fixture omitted management tools); BackendEngineer fixed it (derive per-mode catalog from live server builders); KimiReviewer ACCEPT (model-diverse) + QATester PASS (MCP-3162) on this head + operator-verified. Honest numbers now: retrieve_tools ~17%, code_execution ~43% (were 65.5/70.3). CI green. Author≠approver.
…MCP-42a) (#748) * feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a) Extends the bench/ harness (PR #747) with a live run against a running proxy: - Exact token number: GET /api/v1/tools pulls upstream tools WITH full JSON input schemas; proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema). Schemas counted on BOTH sides so the headline savings is authoritative — and withheld (authoritative_headline=false) if any proxy tool lacks a schema, the MCP-3161 overstatement guard. - Accuracy: replays the Spec 065 retrieval golden set through the proxy BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}/MRR/nDCG@10/MAP against graded labels (deterministic, no LLM). Field names mirror Spec 065 score-report.schema.json. - Latency: client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost (server "took" is a 0ms stub). CLI: `go run ./bench/cmd/bench -live -proxy URL -api-key KEY`. Reports stay gitignored (CN-003). All metric math + the live client are unit-tested with httptest stubs; the docker-compose substrate is the live-reproduction path. Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(bench): preserve upstream schemas through /api/v1/tools baseline ConvertGenericToolsToTyped read generic["schema"], but every producer of the generic tool map (runtime/server GetServerTools, mcp.go) emits the upstream input schema under "inputSchema". The /api/v1/tools response therefore dropped every schema, so the MCP-42a live benchmark baseline was silently a description-only token count instead of the required full-schema count, while still able to emit authoritative_headline=true. - Read "inputSchema" first in the converter, keep "schema" as a legacy fallback. - Gate the live headline on baseline schemas too (BaselineSchemasCounted via anyHaveSchema): a systematically schema-less baseline now withholds the headline instead of claiming a full-schema baseline it never had. - Tests: converter preserves inputSchema (+legacy schema fallback); headline withheld when the baseline carries no schemas. Related #748 * fix(bench): conform live retrieval report to Spec 065 score-report schema Addresses CodexReviewer finding on PR #748 / MCP-3167: the live `retrieval` payload emitted flat metric fields, but score-report.schema.json requires nested `retrieval.metrics` + `retrieval.gate`. Restructure RetrievalMetrics into {metrics, gate} so live_report.json validates against the contract, proven by a new jsonschema-validation test (TestRetrievalMetricsConformsToScoreReportSchema). A standalone live run has no stored baseline, so gate.passed is true by construction (CI regression-gating against a committed baseline is MCP-3133). Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing>
What
First, fully-deterministic slice of the roadmap-#19 benchmark harness (MCP-42): the token-reduction numbers behind mcpproxy's "massive token savings" claim. In-repo under
bench/(per board decision — no separate public repo).Compares the static context-token cost of the three routing modes over a frozen tool corpus:
baseline(all tools loaded)retrieve_tools(BM25 discovery)code_execution(orchestration)These are a conservative floor: input schemas are excluded uniformly (the committed corpus has none), which understates the baseline; and savings scale with tool count (real deployments expose hundreds–thousands of tools).
How
specs/065-evaluation-foundation/datasets/corpus_v1.tools.json) as a versioned, non-drifting universe (CN-002).tiktoken cl100k_base— already a repo dependency, reproducible, model-agnostic estimator. No new deps.internal/server/mcp.gointobench/proxy_tools_v1.json(provenance recorded in-file).go run ./bench/cmd/bench→report.json+ self-containeddashboard.htmlinbench/results/(gitignored; reports never committed per Spec 065 CN-003).bench/README.md.Tests
go test ./bench/— deterministic tokenizer, per-mode tool exposure, real savings in (0,1), baseline monotonicity. Race-clean.gofmt,go vet, and golangci-lint v2 (strict CI config) all clean.Scoped but NOT in this PR (tracked as follow-ups)
These need decisions / other lanes, so they're deliberately deferred (see
bench/README.md):GET /api/v1/toolsfor the exact headline number + Recall@k accuracy (reusing the Spec 065 retrieval golden set) + latency.Related #MCP-42